{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## 浏览器模拟爬虫笔记\n",
        "\n",
        "**Selenium**\n",
        "\n",
        "- 本来是一个自动化测试模块，但既然它能模仿人机交互动作，后来也就自然地被用到了爬虫里\n",
        "- 因为完全模仿真人动作，所以相对难以识别，但速度自然也慢"
      ],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 浏览器\n",
        "\n",
        "首先需要一个准备浏览器供 Selenium 控制。以前一般使用 phantomJS ，但后来它停止开发了。现在用的一般是 Chrome ，但要注意安装的 ChromeDriver 与 Chrome 两者的版本必须要对应（谷歌官方有版本对应表）\n",
        "\n",
        "### 基础设置\n",
        "\n启动浏览器前需要先准备好各项设置，就和人工使用浏览器里的设置一样"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# Selenium 控制浏览器的核心模块\n",
        "from selenium import webdriver \n",
        "\n",
        "# Options，用以调整各项浏览器设置。每种浏览器的设置使用的 selenium 类不一样，这里只写 Chrome 的\n",
        "from selenium.webdriver.chrome.options import Options "
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {
        "collapsed": true
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# 一套个人默认的启动参数，或许可以降低一点被检测发现的可能性\n",
        "\n",
        "chrome_options = Options()\n",
        "\n",
        "# 开启无 GUI 的无头模式\n",
        "# chrome_options.add_argument('--headless')\n",
        "\n",
        "# 忽略 ssl 安全错误\n",
        "# chrome_options.add_argument('--ignore-ssl-errors')\n",
        "\n",
        "# 忽略证书错误\n",
        "# chrome_options.add_argument('--ignore-certificate-errors')\n",
        "\n",
        "# 禁用拓展\n",
        "chrome_options.add_argument('--disable-extensions')\n",
        "# 屏蔽插件发现\n",
        "chrome_options.add_argument('--disable-plugins-discovery')\n",
        "# 以上两项可以让 Chrome 不显示‘Chrome 正在受到自动化测试软件的控制’（可能只是掩耳盗铃）\n",
        "\n",
        "# 开启匿名模式\n",
        "chrome_options.add_argument('--incognito')\n",
        "\n",
        "# 用户配置文件路径取默认\n",
        "chrome_options.add_argument('--profile-directory=Default')\n",
        "\n",
        "# 窗口最大化（分辨率受屏幕硬件控制，影响到屏幕截图的效果。如果是服务器运行应该会有别的办法处理这个问题）\n",
        "chrome_options.add_argument('--start-maximized')\n",
        "\n",
        "# 设置 Chrome 默认下载方式，这里可以设置为不弹窗提示，按默认路径直接下载\n",
        "prefs = {'download.prompt_for_download':False, 'download.default_directory':'D:/'}\n",
        "chrome_options.add_experimental_option('prefs', prefs)\n",
        "# 但如果需要再下载时修改文件名、覆盖另存等操作，个人另外使用 AutoIt 配合进行（Windows 环境）\n",
        "\n",
        "# 根据以上全部配置创建一个 driver 实例\n",
        "driver = webdriver.Chrome(chrome_options=chrome_options) "
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 浏览器基本操作"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# 启动浏览器，新打开目标网页\n",
        "driver.get('target url')\n",
        "\n",
        "# 可以使用 JavaScript 将滚动条拖动到底\n",
        "js = 'var q=document.documentElement.scrollTop=10000'\n",
        "driver.execute_script(js)\n",
        "\n",
        "# 网页截图\n",
        "driver.save_screenshot('filename')\n",
        "\n",
        "# 关闭浏览器\n",
        "driver.quit()"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Cookies 处理\n",
        "\n经典思路是先用 Selenium 解决验证码登录问题，取得登录 Cookies 后改用纯 HTTP 请求加快爬取速度。从而全过程不需要人工介入同时又保证了速度"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# 获取当前 cookies\n",
        "driver.get_cookies() # 全部获取（包括名称、值等等）\n",
        "cookies = driver.get_cookies()['value'] # 仅取出值"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 页面元素定位\n",
        "\n",
        "至少有 8 种定位手段\n",
        "- 一般而言id最理想，不太会因为网站改版而被修改。xpath根据网页结构顺序来定位就很有可能会被改\n",
        "- xpath 可以从 Chrome 的元素审查中右键复制得到\n",
        "\n另外 [Kalaton](https://www.katalon.com/) 的浏览器插件可以记录人工操作，然后直接翻译成 Selenium 定位、操作代码"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# 定位方法在 By 类里\n",
        "from selenium.webdriver.common.by import By\n",
        "\n",
        "driver.find_elemnt_by_ # 后接以下模式。\n",
        "\n",
        "# class_name = 'class name'\n",
        "# css_selector = 'css selector'\n",
        "# id = 'id'\n",
        "# link_text = 'link text'\n",
        "# name = 'name'\n",
        "# partial_link_text = 'partial link text'\n",
        "# tag_name = 'tag name'\n",
        "# xpath = 'xpath'"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 操作链\n",
        "\nActionChains，将各种鼠标键盘操作串成操作链，再一次性批量执行"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from selenium.webdriver.common.action_chains import ActionChains\n",
        "import random\n",
        "import time\n",
        "\n",
        "# 使用随机时间延迟模仿真人鼠标操作的移动、悬停、点击（移动轨迹应该还有更多办法可以伪装）\n",
        "def human_click(element, driver=driver, sleeptime=random.uniform(1,2)):\n",
        "    actions = ActionChains(driver) # 实例化\n",
        "    actions.move_to_element(element) # 移动鼠标\n",
        "    actions.click(element) # 鼠标点击\n",
        "    actions.perform() # 执行整个操作链\n",
        "    time.sleep(sleeptime)\n",
        "    \n",
        "# 还可以实现更多的如按住（clickAndHold）、拖拽（dragAndDrop）等操作，详见文档"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# Keys 键盘操作\n",
        "from selenium.webdriver.common.keys import Keys\n",
        "\n",
        "# 以一个输入框为例\n",
        "input_box = driver.find_element_by_id('input_box_id')\n",
        "input_box.clear() # 清空输入框，但个人猜测这种操作方式不那么地像真人\n",
        "input_box.send_keys('123456')\n",
        "input_box.send_keys(Keys.ENTER) # 回车。其他每个按键不一定都有名字，但肯定有一个特殊代码来指代，见 Selenium 文档\n",
        "input_box.send_keys(Keys.TAB) # 制表符，常用于跳转用户名、密码输入框\n",
        "\n",
        "# 组合键\n",
        "input_box.send_keys(Keys.CONTROL, 'a') # Ctrl+A 全选\n",
        "\n# 按住，顺序按下、释放等复杂操作要配合 ActionChains 处理"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 状况处理\n",
        "\n常用于判断页面的表现是否符合预期，以便决定何时进行下一步操作"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# expected_conditions 预期状况\n",
        "from selenium.webdriver.support import expected_conditions as EC \n",
        "\n",
        "# WebDriverWait 条件等待,经常与上面的状况预测配合使用\n",
        "from selenium.webdriver.support.wait import WebDriverWait\n",
        "\n",
        "# 确认url是否已经跳转，没跳转就继续执行操作（某些网站登陆后 url 肯定变化，就可以以此来确定是否登陆成功）\n",
        "login_check = EC.url_to_be('url_before_login') # 注意是 before\n",
        "while login_check(driver): # 注意需要 driver 实例作为参数\n",
        "    'do something' # 经典用法是继续尝试登录（验证码刷新继续尝试识别）\n",
        "    pass\n",
        "\n",
        "# 等待某个页面元素消失，timeout 参数可设置最大超时，默认不设限制\n",
        "# 常用于等待表示‘loading’的页面元素消失（即页面加载完成后再操作）\n",
        "# 当然也有相反的until操作，详见文档\n",
        "WebDriverWait(driver, timeout=100).until_not(EC.visibility_of(element_you_dont_want_to_see))\n",
        "\n",
        "# 此外还可配合更多的预期状况进行判断，如 clickable 等，详见文档\n"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 配合 OCR 处理验证码\n",
        "\nPytesseract 是 Google 开源 OCR 程序 Tesseract 的 Python 接口，简单训练后足够用于应付比较规整的数字验证码"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import pytesseract\n",
        "from PIL import Image,ImageEnhance # 用于图片预处理\n",
        "\n",
        "# 获取图片并识别\n",
        "# 这里使用的是截图方法，但这就与截图时的窗口大小和屏幕分辨率有关，很不稳定。最好有直接定位元素截取的方法\n",
        "def get_captcha(rangle):\n",
        "    driver.save_screenshot('filename')\n",
        "\tcaptcha = Image.open('filename')\n",
        "\tcaptcha = captcha.crop(rangle) # 图片裁剪，参数为一个四维 tuple，图片坐标从左上角开始向右下单增，可用画图软件观察\n",
        "    # 顺序是（左上横坐标，左上纵坐标，右下横坐标，右下纵坐标）\n",
        "\tcaptcha = captcha.convert('L') # 转为灰度\n",
        "\tcaptcha = ImageEnhance.Contrast(captcha) \n",
        "\tcaptcha = captcha.enhance(2.0) # 提高对比度，2.0在这里是个 magic number\n",
        "\t# 设置二值化的转换阈值，这里取的143也是一个 magic number\n",
        "\tcaptcha = captcha.point(lambda x:0 if x<143 else 255)\n",
        "\tcaptcha_path = 'savefile'\n",
        "\tcaptcha.save(captcha_path)\n",
        "\tnum = pytesseract.image_to_string(captcha, lang=\"你的tesseract字体文件名\") # 训练后的字体文件放入对应文件夹\n",
        "    # 不指定则使用默认字体（没有实际对比测试过识别率，但应该自己训练过的会好一些）\n",
        "\n\treturn num"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 按键程序\n",
        "\n点击下载按钮后弹出的 os 对话框无法通过 Selenium 处理，要么最初就设置成不弹窗直接下载，要么只能靠调用外部程序来操作。比如 AutoIt（os 窗口元素定位），pyautogui（屏幕像素定位）"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# subprocess 和 os 模块差不多\n",
        "import subprocess\n",
        "\n",
        "# 调起外部程序，可以通过str.format等文本格式化方式传入命令行参数\n",
        "subprocess.call('programname {0}'.format('param'))\n",
        "# AutoIt 程序相对简单，写在另一个 .au3 文件里"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "language": "python",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python",
      "version": "3.6.5",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "kernel_info": {
      "name": "python3"
    },
    "nteract": {
      "version": "0.12.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}